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Abstract. IChia and Nakanol ||2009|) introduced the concept of A4-decom- 
posability of probability densities in one-dimension. In this paper, we general- 
ize jVl-decomposability to any dimension. We prove that all elliptical unimodal 
densities are A4-undccomposablc. Wc also derive an inequality to show that it 
is better to represent an .M-decomposable density via a mixture of unimodal 
densities. Finally, we demonstrate the application of .M-decomposability to 
clustering and kernel density estimation, using real and simulated data. Our 
results show that .M-decomposability can be used as a non-parametric crite- 
rion to locate modes in probability densities. 



1. Introduction 

In a recent paper, [Ch ia and Nakanol (l2009f) conceptualized A^-decomposability 



and developed the theory in one-dimension. The main results are summarized in 
the following paragraph. 

.M-decomposability is defined as follows. Let / be a probability density defined 
in one-dimension. There exist countless ways to express / as a weighted mixture 
of two probability densities, in the form of 

f(x) — ag[x) + (1 — a) h(x) where < a < 1 . 

If it is possible to find any combination of {a, g, h}, which satisfies 

af > cr g + ct/j where <7/ denotes the standard deviation of /, 

then the original density / is said to be M-decomposable. Otherwise, / is M- 
undecomposable. Intuitively, multimodal densities with peaks separated far apart 
are likely to be .M-decomposable. Conversely, unimodal densities are probably M- 
undecomposable. The authors proved that all one-dimensional symmetric unimodal 
densities with finite second moments are .M-undecomposable. In other words, if 
/ is symmetric unimodal and has finite second moments, then for any weighted 
mixture density components {g, h} of /, one must have 

(1.1) cr/ < a g + (T h . 

Eq (jl.ljl applies to a wide range of densities that include Gaussian, Laplace, lo- 
gistic and many others. The authors also showed the possibility of using A4- 
decomposability to perform cluster analysis and mode finding in one-dimension. 
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Incidentally, the "JA" in .M-decomposability may either mean "multimodal" or 
"mixture" . 

In this paper, we further contribute to A^-decomposability, both in the theoret- 
ical and applicational aspects. On the theoretical front, we generalize the concept 
of .M-decomposability to any cZ-dimensional space. First of all, we derive a theorem 
(Theorem I2.3j) that is the d-dimensional equivalent of Eq We prove that all 

elliptical unimodal densities with finite second moments are .M-undecomposable. 
These densities include multivariate Gaussian, Laplace, logistic and many others. 
Following that, we derive another theorem, (Theorem 12 .4[) , which determines if a 
given density is better approximated via a mixture of Gaussian densities, instead 
of one single Gaussian density. 

One example of application of A^-undecomposability is cluster analysis. For 
decades, cluster analysis has been a popular research subject, both from the theoret- 
ical and algorithmic aspects. Cluster analysis is likely to remain a widely researched 
topic, given the many different a pproaches that caters to varying applications. The 
survey paper by iBerkhinl <|2002i ) provides an up-to-date status of available cluster- 
ing techniques and methodologies. There are two main classes of cluster analysis 
methodologies: parametric and non-parametric. For parametric cluster analysis, 
one needs prior knowledge or assumptions on the analytical structure of the underly- 
ing clusters. The whole dataset is modeled as a mixtur e of fc parametrized densities , 
and the problem reduces to parameter estimation. In iMcLachlan and Peell |2000), 
parametric cluster analysis via the Expectation-Maximization (EM) algorithm is 
described in detail. Ot her parametric me thods include the Bayesian particle fil- 
ter approach detailed in iFearnheadl (l2004h. and the reversible jum p Markov chain 
Monte Carlo (MCMC) approach by iRichardson and Green (1997). For parametric 
cluster analysis, the most popular approach is to model the clusters as Gaussian 
densities. 

As for non-parametric cluster analysis, a popular tool is the fc-means algorithm. 
The fc-means algorithm is optimal for locating similar-sized spherical clusters within 
a dataset, provided the number of clusters are known beforehand. With elliptical 
clusters, or clusters of varying sizes, the fc-means approach yields results that are 
meaningless. The fc-means algorithm assigns samples to clusters based on dis- 
tance (Euclidean or its variations) to the centres of the clusters. Other distance- 
based non-parametric clustering algorithms include the nearest-neighbour cluster- 
ing. Distance-based clustering algorithms generally share the same drawbacks such 
as sensitivity to scaling, elliptical clusters and clusters of varying sizes. If the num- 
ber of clusters are not known beforehand, neither the fc-means algorithm nor the 
nearest-neighbour algorithm estimate the number of clusters automatically. For 
the fc-means algorithm, the unknown number of cl usters has to b e re-evaluated via 
Akaike's information criterion (AIC), proposed by I Akaikd (|1974j ). or other suitable 
model selection criterion. 

Our approach to cluster analysis via .M-decomposability is non-parametric and 
are based on volume instead of distance. Being non-parametric, prior knowledge 
on the analytical structure of the underlying clusters is unnecessary. The only as- 
sumption required is that the clusters are approximately elliptical and unimodal. 
As a result, the limitation of clustering via .M-undecomposability is that it will 
probably not perform ideally for irregularly shaped clusters that deviate from el- 
liptical unimodal densities. However, if the clusters are approximate elliptical and 
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unimodal, then our clustering methodology works well, and allows for the unknown 
number of clusters to be recovered automatically. Furthermore, as clustering via 
VW-decomposability is based on volume instead of distance, cluster allocation is 
invariant to scaling. 

For existing alternative methodologies to clustering, there has been recent devel- 
opmen t on Rousseeuw's minimum volume ellipso ids (MVE) in Rousseeuw and Lerovl 



opment on Koussecuw s minimum volume ellipsoids ^ivivr^j m rtouss eeuw ana L/erov 
(|l987l ) and Rousseeuw and van Zomerenl (|l990( ). The MVE approach is originally 



developed as a robust method to estimate mean vectors and covariance matrices 
of multivariate data in the presence of outliers. MVE is computationally inten- 
sive and the optimal solution is often difficult to achieve, prompting many research 
papers on the algo r ithmi c aspects of the problem. Some authors, for example, 
Shiod a and Tuncell <|2005h . outlined a heuristic for clustering via MVE by min- 



imizing the sum of volume of clusters. Our methodology of clustering via M- 
decomposability has some similarities with clustering via the MVE approach, in 
that both measure "volume" in a certain sense. Central to the A4-decomposability 
concept is the "pseudo- volume" , which we define as the square-root of the determi- 
nant of the covariance matrix. Compared to MVE, the pseudo-volume is compu- 
tationally cheap and straightforward. On top of that, we also provide theoretical 
justifications in Theorem l2.4l for minimizing the sum of pseudo- volumes of clusters. 

Another possible area of application of A4-undecomposability is density esti- 
mation. In density estimation, data generated from some unknown densities are 
given, and the task is to estimate and recover the unknown density. One popular 
no n-paramet r ic app r oach t o densi t y estimation is kern el density es timation, treated 
' Silverman! (|l986l ). IScottl (jl992h . IrTardle et al (|2004l ). as well as lWand and Jones! 



(|1995I ). The difficulty in kernel density estimation is the derivation of the optimal 
kernel bandwidth: If the kernel bandwidth is underestimated, the kernel density 
becomes unduly spiky; if the kernel bandwidth is overestimated, the kernel density 
becomes oversmoothed. For multimodal densities, it is not possible to find a single 
kernel bandwidth that provides a satisfactory density estimation everywhere. Us- 
ing .A/f-decomposability, we demonstrate that there is a simple and logical way to 
circumvent the above problem by representing the underlying density as a mixture 
of unimodal densities where necessary. 

This paper develops both the theoretical and applicational aspects of A^-decom- 
posability, and therefore should be of interest to theoretical statisticians and practi- 
tioners alike. Section[5]is devoted to the theoretical development of A4-decomposability 
in d-dimensional space. For readers who are only interested in applications, it is 
possible to note only the results of Theorems 12.31 and 12.41 skipping the rest of 
Section [2] without disrupting the flow of the paper. 

2. A4-DECOMPOS ABILITY IN d-DlMENSIONAL SPACE 

2.1. Extensions from One-Dimension. In Chia and Nakanol ( 20091 ). Af-decom- 



posability involves only the standard deviations of probability densities. This is 
because in one-dimension, the standard deviation is a natural measure of scatter 
of a given density. The standard deviation of any density in one-dimension has the 
same order as the distance or "length" computed from the mean. When considering 
higher dimensions, a possible corresponding measure of scatter of a given density 
is the square-root of the determinant of the covariance matrix of the density. The 
square-root of the determinant of the covariance matrix in d-dimensional space 
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has the same order as e?-dimensional "hypervolume" . Henceforth, we shall call the 
above measure the pseudo-volume of a density. We denote the covariance matrix 
of a density / by £/ , and therefore the pseudo- volume of / is given by | £/ 1 ' . In 
one- dimension, pseudo-volum e reduces to the standard deviation. 

In[Ch ia and Nakand (|2009l) . the authors limited the number of mixture compo- 



nents to two in their development of ./Vf-decomposability. In this paper, we show 
that it is possible to relax the above limitation, and generalize the number of mix- 
ture components to m where m > 2. Let / be a probability density function defined 
on lZ d , the d-dimensional real space. One can always express / as a weighted mix- 
ture of m densities as follows: 

(2.1) /(x) = ax ,9i (x) H h a m g m (x) , 

where < on < 1 and Ea; = 1. Henceforth, we call any set of densities {<?i, . . . , g m } 
which satisfies Eq (|2.1I) a set of mixture components of /. 

We extend the definition of VW-decomposability to ci-dimensional space as follows. 

Definition 2.1 (A^-Decomposability). For a given probability density function f, 
if there exists a set of mixture components {gx, . . . , g m } such that 

|£/|* > |£ ai |5 + ... + |£ s J*, 

then f is defined to be Ai-decomposable. Otherwise, f is M-undecompo sable. If 
for any set of mixture components {g\, . . . ,g m }, 

|£,|4 < |£ 91 |* + ... + |£ s J', 
then f is strictly M-undecomposable. 



Our new definition of .M-decomposability reduces to that presented in [Ch ia and Nakano 



(2009) when m = 2 and d = 1. For d > 2, the definition of .M-decomposability can 



be described compactly using pseudo- volumes. 

2.2. Elliptical Uniform Densities. The uniform density is trivially defined in 
one-dimension, but in higher dimensions, it may assume many different possible 
shapes. For example, one may think of the uniform hypercube or the uniform 
hypersphere. However, the subject of interest in our paper is the elliptical uniform 
density, which forms the fundamental building block of elliptical unimodal densities. 

Ellipticity, uniformity and unimodality are three different qualities. The defini- 
tions of the first two are given immediately below, and the third will be given in 
Section [ 



Definition 2.2 (Elliptical and Spherical Densities). We say that f is elliptical if 

there exist a vector [i € 1Z d , a positive semidefinite symmetric matrix £ € j^dxd 
and a positive function p on 1Z + U {0} such that 

/(x)=p{(x- A1 ) T £- 1 (x- A1 )}. 

Furthermore, if £ = kid, where k > and Id denotes the d-dimensional identity 
matrix, then f becomes 

/(x) = pi{(x - fi) T (x - fj,)} = p 2 (\x - fj,\) , 

and we say that f is spherical. 
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Remark The mean and covariance matrix of the above-defined elliptical density 
/ are as follows: 

fit = ji, E j = c E where c > . 

Definition 2.3 (Uniform Densities). We say that f is elliptical uniform if there 
exist a vector /x £ lZ d , a positive semidefinite symmetric matrix £ £ TZ dxd , and a 
positive real number r such that 

/(x) OC I( X -/*)T S" 1 (x-^i)<r 2 ) 

where I denotes the indicator function. Furthermore, ifH — kid, where k > and 
Id denotes the d-dimensional identity matrix, then f becomes 

/(x) OC I( X - M )T (x-^i)<r' 2 = I|x-/i|<r' 7 

and we say that f is spherical uniform. 

Theorem 2.1 (Inequality on Elliptical Uniform Densities). All elliptical uniform 
densities defined on lZ d are M.-undecomposable in d = 1 and strictly M-undecomposable 
for d>2. 

The proof of Theorem 12.11 proceeds the following lemma. 

Lemma 2.1 (Density with Minimum Pseudo- volume). Let f be a probability den- 
sity function defined on x £ lZ d such that /(x) < Mf for all x. Then 

|E,|*> ^ + 

M f {-K{d + 2)}i 

Identity holds if and only if f is elliptical uniform with max(/) = Mt. 



Remark When d = 1 , we recover af > 1/(M/V12), the result obtained in 
Chia and Nakanol (|2009h . 



The proof of the Lemma 12.11 has been relegated to Section 15.21 of the appendix 
to enhance the flow of the paper. We use the results of Lemma 12.11 to prove 
Theorem [2~T1 

Proof of Theorem \2.1\ Let u be an elliptical uniform density on x £ 7Z d (d > 1). 
We need to prove that for any set of mixture components {y\, . . . , v m } of u, 

|£ Wl |3 + ... + |£„j3 > |E„|*. 

Without loss of generality, set max(u) = M and therefore 

IE I*- ^ + 

' Ul M{x(d + 2)}i ' 

Rewriting the elliptical uniform density u as mixture components, we have 

u(x) = ai «i(x) + . . . + a m u m (x) 

for some . . . , a m } satisfying < ay < 1 and Ea^ = 1. As a result, we have 

. , u(x) M 
Vjhi) < — — < — 
ay aj 

for all 1 < i < m. Using Lemma 12.11 we have 

(2-2) |^|^> jf^'L = a ^ 

M {n (d + 2)} 2 
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for all j, with equalities holding if and only if the density in question is elliptical 
uniform. Now, for d > 1, we can have at most (m — 1) but never all of w's to be 
elliptical uniform satisfying Eq (I2.2[) . Therefore, 

|E 0l |* + ... + |E„J* > 
Identity may only hold when d = 1, refer to Chia and Nakanj ( 20091 ). □ 



2.3. Elliptical Unimodal Densities. In one-dimension, symmetry is trivial to 
visualize and express mathematically. In higher dimensions, symmetry may be de- 
picted via ellipticity. As such, elliptical unimodal densities play a key role in this 
paper. We provide a definition for elliptical unimodal densities below. Elliptical 



densit ies in general have been treated in detail by many researchers, see lFang et a\ 



(1990) and references within. Uni modal dens i ties h a ve also been the subject of ac- 



tive re search. Fo r example, refer to Anderson] ( 19551 ) , iDharmadhikari and Joag-Dev 
(|l987t) as well as llbragimovl (|l956l) . 



Definition 2.4 (Elliptical Unimodal Densities). We say that f is elliptical uni- 
modal if there exist a vector [i E lZ d , a positive semidefinite symmetric matrix 
£ E fZ dxd and a non-increasing positive function p on 1Z + U {0} such that 

/(x)=p{(x- / i) T E- 1 (x- M )}. 



Comparing with Definition ^. 2[ the only additional information in Definition 
is that the positive function p has to be non-increasing as well. According to Defi- 
nition [231 elliptical unimodal densities are those whose cross-sections are elliptical, 
and with mean (p) and covariance matrices proportional to (£). Definition 12.41 
encompasses a large class of general densities including e?-dimensional elliptical uni- 
form, Gaussian, logistic, Laplace, Von Mises, beta(/c, k) where k > 1, student-i, 
and many other densities. 

Henceforth, we propose the following alternative representation of elliptical uni- 
modal densities. 

Theorem 2.2 (Representation of Elliptical Unimodal Densities). Let f be an ellip- 
tical unimodal density with mean \i and covariance matrix E. Then, for all e > 0, 
it is possible to construct a density 

n 

9n(x) =^6 J -u i (x) 

such that 

J |<? n (x)-/(x)|dx<e. 
Here, each Uj is an elliptical uniform density such that 

( 2 -3) w j( x ) « I (x- ( u) T S-i (x- M )<r 3 2 

and r 's are strictly positive. Furthermore, each proportionality constant bj satisfies 
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From the above representation, each elliptical uniform component is weighted 
proportionally to the hypervolume of its cross-section. The original elliptical uni- 
modal density is "sliced lattitudinally" into elliptical uniforms with a prefixed con- 
stant "thickness" . The proof of Theorem 12.21 has been relegated to Section 15.31 of 
the appendix. 

2.4. A Theorem on Elliptical Unimodal Densities. 

Theorem 2.3 (Inequality on Elliptical Unimodal Densities). Let f be an ellip- 
tical unimodal density with finite second moments. Then, for any set of mixture 
components {gi, . . . ,g m }, 

|£ / |4<|S ai |! + ... + |E s j4. 
Identity is possible only when f is uniform in one- dimension. 

Proof. Our task is to prove that for all mixture components {gi, . . . , g m } satisfying 

m 

(2-4) /(x)=£ai Si (x), 

i=i 

where < a% < 1 and E a% = 1 , we must have 

(Claim 1) lEyl* < |E ffl |H... + |£ s ji. 

Using Theorem 12.21 we can approximate / to an arbitrary level of accuracy by 
rewriting / as a finite mixture of elliptical uniform densities, each having "uniform 
thickness" as 

n 

(2.5) /(x)=£i^(x). 

i=i 

The "thickness" of each elliptical uniform component is equal to max{ bj u :(x) } . 
Here, described in Eq (12.31) . are elliptical uniform densities sharing the 

same means and whose covariances are multiples of each other. Each constant of 
proportionality, denoted by bj, is proportional to the hypervolume of the corre- 
sponding elliptical uniform density Uj. 

To provide a link between Eqs (|2.4p and (|2.5p . we further rewrite / as 

m n m n 

/(x) = Y c ^ v ^ w = Y a% = Y h 3 u 3 ( x ) • 

i=l j=l i=l j=l 

For each pair of (i, j) above, cy Vij (x) is the "intersection" of the segments <7i(x) 
and bj Uj(x) with respect to / on the curve. For all values of gi and Uj can 

be expressed in terms of 

n m 

(2.6) a, g;(x) = ^ a j Ui l3 -(x) , bj tt 3 -(x) = V;,j(x) . 

i=i »=i 

Here, depending on the mixture components {gi, . . . ,g m }, it is possible for some of 
Cjj's to be 0, as long as for all values of {i, j}, we have 

n m 

a i = Cj tJ - > , 6j = Y. C i,j > • 

3=1 i=l 
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If Ci.j > for a pair of then u*,j(x) is a density. From Eq (|2.6|) . we can rewrite 
each elliptical uniform Uj as 

m 

i=l J 

Following the argument presented in Theorem l2.il we have 

|E„ji>fi|E % |5, 

with equality holding if and only if Vij is elliptical uniform having "thickness" 
satisfying 

maxjcij Wi,j(x)} = max{bj Uj(x)} . 
Similarly, rewriting each mixture component g t in terms of Vi j, we obtain 

n n 
i i -i 

Next, we create new spherical unimodal densities g^s corresponding to each g^ 
to facilitate lower boundings of |E g< |. Define cji as follows: 

n n 

i=i ° l i=i 

In the above, each {&£,]., . . . , v^n} are spherical uniforms whose means coincide and 
such that 

maxjcij 5jj(x)} = max{&j Mj(x)} 
for all hence yielding 

|E^ = fi|E M ^. 

Computing the determinant of the covariance matrix of <?;, we have 

l^ffil = l( S »,l H ^ S »,™ £«<,n) + ( S *,l Moi,lMu 4il H + S »,™ /-*«i,nMuj,„)l 

^ £-!)«, i + • • • + Si.n £«»,„ I 

> (« il l|£« M |*+"-+*i,»|E 04i „|*) d 

> (sj.ilE^jH +--- + s iin |E 5ii „|3) d 
— Eiii.i + • • • + Si jn £{j iin | 

= |S & |. 

The first inequality holds as a result of 

(2.7) \Kt + K 2 \ > \K!\, 

where K\ and K 2 are both non-negative definite symmetric d x d matrices. The 
second inequality holds because 

(2.8) \K X + K 2 \^ > \K^ + \K 2 \^ , 

with identity holding if and only if and K 2 are proportion al. The proof of both 
Eqs (j2~T|) and (|2~g)l can be found in lCover and Thomas] (|l988f ). The third inequality 
holds as we must have 

|E„J > |E S . ,| 
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as a direct result of Lemma 12.11 The equality that follows the third inequality is 
again a result of Eq (|2.8|) , as all Ec 4 , 's are proportional to the identity matrix. We 
have just shown that 



|E„|> 1% 



for all gi, i.e. the pseudo- volume of each gi is minimized when is spherical 
unimodal. Therefore, a sufficient condition to (Claim 1) is 



(Claim 2) |£/|*<|E & |* + ... + |E 



Since / is elliptical unimodal, it is possible to find a corresponding spherical 
unimodal density f s such that the hypervolumes are preserved, i.e. \f 8 \ = \f\. 
To prove (Claim 2), we only have to deal with the pseudo- volumes of spherical 



unimodal densities. We obtain \Sg f \ as follows 



1+4 1- 



ff " (d + 2) d y ci.i + ••• +d, n 

Here, we make use of the fact that the covariance of a <i-dimensional spherical 
uniform density defined by 

u(x) cx I| x _ M | <r 

is given as 

r 2 

E " ~ (d + 2) ' Irf 

where I<z denotes the identity matrix in d-dimensional space. Refer to Eq (|5 . 5|) . 
Similarly, 

i -t-— i+— 
iv i- 1 A + 
1/1 (d + 2) d ' l 6i + ---+6„ j • 

Hence, proving (Claim 2) is equivalent to proving 
(Claim 3) 

+ , <ffi+... + cffi , , cgj+^+cgl , 
(— ) 2 < ( — : : — ) 2 + . . . + ( — ; ) 2 , 

bi + ... + b n Ci,i + . . . + C\, n Cm,\ + ■ ■ ■ + C m ,n 

where bj = c\j + . . . + c mj for all j. To prove (Claim 3), we just have to invoke 
Lemma 12.21 given below for a total of (m — 1) times, adding up summands on the 
RHS two at a time and maintaining the "<" sign. We are now left with proof of 
Lemma 12.21 to prove Theorem [ 



Lemma 2.2. Let at, bi, Ci be sequences of non-negative real numbers such that for 
all i, di = bi + and dj > 0. Then the following inequality holds for any positive 
integers d and n. 

1+2 1 + 1 1 + 1 1 + 1 1+1 1+1 

a x d +---+a n d d V d +••• + &„ \a c x d + ■ ■ ■ + c n d «? 

a\ H + a„ oi + • • • + o n ci + • • • + c„ 

Equality holds if and only if the sequences a%, bi and Ci are linearly dependent. 



Proof. The proof is similar to that of iChia and Na 



canol (l2009h with the only 



diffe rence being in d. W e proceed in the spirit of 



Hardv et J (|l988l ). as well 



as IPolva and Szegdl (jl9721 ). Set x = [x\,--- ,x n ] T , y = [yi,--- ,Vn] T and z 
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[zi, ■ ■ ■ ,z n ] T and similarly for a, b, c. Let x = t y+ (1 — t) z, i.e. Xi = tyi + (l — t) Z{ 
for all i. Furthermore, define the function / as follows: 

1+4 i+- 

= Xl + ••• + £» ! 
y ' v Xi + --- + X n ' 

and set 

#t) = /{ty + (l-t)z} = /(x), 
where < t < 1. It suffices to prove that </>"(i) > for < t < 1. This is an 
immediate consequence of Jensen's inequality as </>"(£) > implies 

#t)<t#0) + (l-t)#l). 

Setting £ = ~ , we have 

/(| + |)<^/(y) + 5/(z). 

Denoting by y = b, z = c, this becomes 

f(^)<\f{h) + \f{c). 
However, from the definition of /, we must have 

/(f) -5/W. 

Therefore <j>"{t) > implies /(a) < /(b) + /(c) as required. Equality holds if and 
only if <t>"(t) = 0. 

We shall begin from the definition of <f> as follows: 

^) = /(x) = (S^)#(Eo; i )-i 
Differentiating twice with respect to t and rearranging, we have 

4>"{t) _ d(d + 2) n^-Zk) Exl(y k -z k ) 2 



+ (^) • (Sx- + i )- 2 • [(Ex^) • {Xx]-\ yj Zjf} - {Es* (y k - z k )Y 



The term A is expressible as a square and therefore greater or equal to 0. To 
evaluate B, we set pf — x\ + d and g 2 = — Zj) 2 , yielding 

B = (Ep?) • (E<?f) - (Ep fc g fe ) 2 >0 
via Cauchy-Schwarz's inequality. Therefore we must have 

> o 

due to the non-negativeness of Xi,yi and z%. Hence, Lemma l2.2[ and consequently, 
Theorem 12.31 is proved. □ 



As a result of Theorem l2.3[ all elliptical unimodal densities with finite second mo- 
ments are VW-undecomposable. Conversely, any density, which is VW-decomposable, 
cannot be elliptical unimodal. One can do better than that. In the next subsection, 
we further show that if / is .M-decomposable, then there exists an approximation 
to represent / via a mixture of Gaussian densities, which improves estimation of /. 
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2.5. Estimation of A4-Decomposable Densities. 

Theorem 2.4 (Inequality on VW-Decomposable Densities). Let f be probability 
density functions defined on x € lZ d . Let {gi, . . . ,g m } be a set of mixture compo- 
nents of f such that 

/(x) = ai .gi(x) + . . . + a m g m (x) , 
where < otj < 1 and So? = 1. Then the following result applies: 

|E / |4>|E fll |i + ... + |E ffm |i 
=> KL(f || /) > KL(f \\aig 1 + ... + a m g m ) . 

Here, KL{p \\ q) denotes the Kullback-Leibler divergence between densities p and q, 
given as 



KL(p\\q)= f p( x )log44* 
J ?( x ) 



Furthermore, f denotes the Gaussian density whose mean and covariance matrix 
coincide with those of f , and g 's are similarly defined. 

Proof. We only need to prove that 

(Claim A) J /(x) log/(x) dx < J /(x) log{ai gi(x) + . . . + a m g m {x)} dx. 
Now, RHS of (Claim A) 

= J {"i5i( x ) + ■ • • + a m g m (x)} ■ log{a 1 g 1 (x) + . . . + a m g m {x)}dx 
> ai J gi(x) log{ai gi (x)} dx + . . . + a m J g m (x) log{a m g m (x)} dx 
= a>i {log«! + J gi{x) log<h(x) dx } + ... + a m {loga m + / ' g m (x) \ogg m (x) dx}. 

From definitions, the probabilitiy density function of g(x) is given by 
g(x) - (2tt)-' |E 9 |"i exp{-l (x - /i g ) T E; x (x - ,i g )} , 

where fi g and £ 9 denote the mean and covariance matrix of g. We obtain 

/d 1 d 

g(x) logg(x)dx = --log(27r)--log|E ff |--. 

Hence, RHS of (Claim A) 

> ai { logai - ~ log |E S1 1 } + ... + a m { \oga m - ^ log |E 9m | } - | log(27r) - ^. 
Meanwhile, 

LHS of (Claim A) = -~ log(27r) - i log|S / j - |. 
To complete the prove of Theorem 12.41 it suffices to demonstrate that 

(Claim B) ai log + . . . + a m log ^ gm ^ <log|S f |*. 
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Figure 1. Original data from multimodal density drawn from 
mixture of five logistic densities. 



Using Jensen's inequality, we have 

IS I2 IS Is 

LHS of (Claim B) < log(aiJ— ^- + . . . + a J gml ) 

oc\ oc m 

= log(|S gi |i+... + |S 9m |5) 

< log I S/ 1 3 = RHS of (Claim B), 

which completes the proof of Theorem 12.41 □ 

We summarize the result of Theorem 12.41 as follows. Let / be any density in 
d-dimensional space. If / is .M-dccomposable, then by definition, one can find 
a set of mixture components of /, such that the sum of pseudo- volumes of the 
mixture components is less than the pseudo- volume of the original density /. From 
Theorem 12.31 / cannot belong to the class of elliptical unimodal densities. It is 
possible to do better than that. Theorem 12.41 shows that / is better estimated via 
a weighted Gaussian mixture, rather than a single Gaussian density. The Gaussian 
components are created via moments matching of the mixture components of /. 
The better goodness of fit by the resultant weighted Gaussian mixture estimate 
is guaranteed in Kullback-Leibler sense. It should be noted that the analytical 
form of the original density / does not need to be known. In the next section, 
we demonstrate the use of Theorems 12.31 and 12.41 to satistical applications, namely 
cluster analysis and kernel density estimation. 

3. Applications Using tW-Decomposability 

3.1. Clustering via .M-Decomposability: The Power of Two. One straight- 
forward application of A^-decomposability is cluster analysis. Many existing clus- 
tering algorithm divide the dataset into clusters, based on the following heuristic: 
That the within-variances of clusters are minimized while the between-variance is 
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Figure 2. Original data split into two mixture components, rep- 
resented by two different symbols. The sum of pseudo-volumes of 
the mixture components is less than that of original. 



maximized at the same time. Another variation to this heuristic is to determine 
clust er allocations such that a f unction of volume of clusters is minimized. In partic- 
ular, IShioda and Tuncell (J2005) proposed dividing the dataset into k clusters, such 



that the total sum of MVE (minimum volume of ellipsoid) of k clusters are globally 
minimized. While the details for each algorithm may differ, the underlying idea is 
conceptually similar. Theorem 12.41 provides theoretical justification for minimizing 
sum of pseudo-volumes, and therefore supports all similar approaches of existing 
algorithms. 

Intuitively, the rigorous approach to implement cluster analysis via Theorem [23 
is to divide the dataset into k(> 1) clusters, such that the sum of pseudo- volumes 
of all clusters are globally minimized. This approach is computationally unfeasible 
for dataset of any reasonable size. To this end, we propose the following alternative 
approach that captures the essence of Theorem 12.41 as far as possible. We devise 
a split-merge clustering strategy that involves splitting and merging, two clusters 
at a time. This lowers the overall computational load. We show that with our 
approach, the algorithm is able to overcome local minima. Consequently, it is 
possible to perform cluster analysis well, even with k( > 2) clusters. 

From the given sample F = {X\,--- ,X n }, we are interested to know if the 
original sample is .M-dccomposable. We check if F can be partitioned into two 
clusters, such that the sum of pseudo- volumes of the clusters is less than that of F. 
We denote as {G, H}, a partition of F, such that 

G = {Yi, • • • , Y m }, H = {Y m+ i, • • • , Y n } 



and G U H = F, with Y's being a rearrangement of X. We further denote the 
sample covariance matrices of F,G,H as Sf, Sq and Sh- Our task is to find the 
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sub-decomposilion ol cluster labeled (+) in Fig 2 
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Figure 3. Mixture component denoted by (+) in Fig 2 split into 
two further mixture components. 



sub-decomposilion of cluster labeled (o) in Fig 2 
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Figure 4. Mixture component denoted by (o) in Fig 2 split into 
two further mixture components. 



optimal partition {G,H} such that 

|5 G |3 + \S H \^ 

is globally minimized and test this value against |iSf|2. If 

\S G \i + \S H \ L 2 



(3.1) 



\S F \ 



<l+T s , 
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Figure 5. Mixture component denoted by (+) in Fig 3 split into 
two further mixture components. 



sub-decomposilion of cluster labeled (o) in Fig 4 
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Figure 6. Mixture component denoted by (o) in Fig 4 split into 
two further mixture components. 



where t s is a threshold value close to zero, then, we can conclude that F is likely to 
be .M-decomposable. However, if Eq (|3.ip is not satisfied, then F is likely to be A4- 
undecomposable. To robustify the "splitting process" against local minima traps, 
it is possible to set the RHS of Eq (|3.1[) to be greater than 1. Furthermore, taking 
into consideration error due to finiteness of sample sizes, imperfection of splitting 
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algorithms, and also accounting for limiting the number of mixture components to 
two, we recommend that the t s on the RHS of Eq (|3.1|) to be about 0.05. 

When one concludes that a particular cluster F is probably .M-undecom-posable, 
it is possible to stop at one cluster. However, if F is found to be .M-decomposable 
into clusters of G and H, one may repeat the splitting process for G and H. The 
process is then reiterated until all clusters are probably .M-undecomposable. When 
that happens, the splitting process ends. 

Our strategy also includes "merging" of clusters. At the point when all splitted 
clusters are probably .M-undecomposable, we select two clusters at a time and 
perform the following test. Now, let Q, R denote the two chosen clusters and P 
be the union of the two clusters, i.e. P = Q U R. We then check the sum of the 
pseudo- volumes of Q and R and compare against that of P. If 



(3.2) 



\Sr\ 



\S f 



> 1 



we conclude that Q and R should be merged to form a larger cluster P. This 
process is repeated until there are no more mergeable clusters left. To prevent 
overclustering, we recommend r m to be around —0.05. 

We have described a possible algorithm using .M-decomposability to perform 
cluster analysis. The crucial point is to find a partition {G, H} such that |6>g| 3 + 
|iSf/|2 is minimized as far as possible. There are many possible approaches to this 
task. To find the global minimum of the sum |6>g| 5 + |<Sff| 3 is computationally 
unfeasible and may be NP-hard. Here, we propose a computationally simpler ap- 
proach. At each spitting step, we simply fit a two-mixture Gaussian to the original 
cluster F, and then run the EM algorithm to convergence to obtain the partition 
{G, H}. However, we emphasize that the EM algorithm approach itself is not criti- 
cal, and that it is possible to use other approaches to obtain a reasonable partition 
{G, H} of F at the splitting step. The main point here is the concept of cluster- 
ing via A^-decomposability. In the two examples presented below, we show that 
it is possible to perform clustering analysis reasonably well, using our proposed 
algorithm. 



3.2. Clustering of Simulated Data. The simulation example provided here is 
drawn from a five- mixture logistic densities as follows. The sample F is generated by 
100 samples each from five logistic densities with the following means and covariance 
matrices: 
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Figure 7. All six .M-undecomposable clusters of original data, 
represented by six different symbols. 
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Figure 8. Final cluster allocation formed by merging clusters 
from Fig 7. Five clusters are recovered faithfully. 



Fig 1 shows the original sample F. Clustering is performed without knowledge of 
either the number of clusters or the functional form of the clusters. At the first split 
step, we fit a two-Gassian mixture to F, and perform EM to obtain the partition 
{G, iff. The result is shown in Fig 2. As Eq ([511 is satisfied for F, G, H, we split F 
into G and H . This is a case of EM converging to a local minima as it is (visually) 
unlikely that G and H are meaningful clusters of F. However, from Eq ()3.1I) . it 
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Figure 9. Final cluster allocation via fc-means, represented by 
five different symbols, fc-means algorithm fails to recover clusters 
faithfully. 



is theoretically better off to split F into G and H. The theoretical justification 
is given in Kullback-Leibler sense. The splitting process is repeated for G and H 
and the results are shown in Figs 3 and 4. The splitting process continues until we 
arrive at six clusters that are are all .M-undecomposable (Fig 7). Finally, we begin 
the merging process and find that the two clusters Q, shown as asterix (*) and R, 
shown as circle (o) in Fig 7, satisfy Eq (|3.ip where P — Q U R. The two clusters 
are then merged and we are left with five clusters shown in Fig 8. This example 
shows that our algorithm is easy to implement and is robust to local minima. 

A popular clustering algorithm is the fc-means method, which is optimal for 
nearly spherical clusters. However, it does not work here because of the presence 
of inherently elongated clusters. Even by setting k = 5, the fc-means method does 
not achieve a meaningful cluster allocation, as shown in Fig 9. Cluster analysis via 
fc-means is sensitive to rescaling of axes, because fc-means involves comparison of 
distances. To improve the performance of fc-means analysis, there exist many pre- 
processing heuristics, e.g. rescaling the axes such that all axial units or marginal 
standard deviations become compatible. For this simulation example, rescaling is 
unlikely to improve cluster analysis via fc-means because elongated clusters are not 
likely to be eliminated. On the other hand, cluster analysis via .M-decomposability 
involves comparison of pseudo-volumes instead of distances, and are therefore in- 
variant to rescaling of axes. 

3.3. Clustering of Iris Dataset. Next, we anal yze Fisher's Iris dataset via A4 - 
decomposability. The dataset was obtained from lAsuncion and Newman ( 2007 ). 



The dataset consists of 150 four-dimensional data. The four attribute information 
given are sepal length, sepal width, petal length and petal width, all in centimetres. 
There are altogether three classes, namely "Setosa" , "Versicolor" and "Virginica" , 
in the proportion of 50 : 50 : 50. 
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Figure 10. True Iris data: setosa(asterix), versicolor (cross), virginica(triangle). 

We perform cluster analysis of the dataset via A^-decomposability, without 
knowledge of the actual number of classes. At the end of the analysis, we con- 
firm that there are altogether three classes, in the proportion of 50 : 45 : 55. The 
first 50 data coincide with "Setosa" (0 misspecification) . For "Versicolor" and 
"Virginica" , there are altogether five misspecifications. (Five "Versicolor" are mis- 
labeled as "Virginica"). The data is depicted graphically in Fig 10 (true class) and 
Fig 11 (estimated class). 

Although our analysis results in five cases of misspecifications, our allocation 
of "Versicolor" and "Virginica" achieves a smaller pseudo-volume than the "true 
class". Denoting the "true" classification of "Versicolor" and "Virginica" by {vi,V2}, 
and our estimation by {vi,i>2\ respectively, our estimation yields 

IE*,!* + |S« 2 |5 „ 0.01563, 

as compared to 

+ |£« 2 |* a 0.01587. 

The pseudo-volume of "Versicolor" and "Virginica" combined into a single class is 
approximately 0.01799. 
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Figure 11. Iris data recovered via A^-decomposability: se- 
tosa(asterix), versicolor (cross), virginica(triangle). 



3.4. Kernel Density Estimation. Density estimation is an important statistical 
tool that is widely used in many scientific and engineering fields. Given raw mea- 
surements or data, the task is to recover the unknown density from which the orig- 
inal data is generated. The problem statement is as follows. Given {X\, ■ ■ ■ X n }, 
which is generated from an unknown distribution with density /, the task is to 
estimate /. For simplicity, we consider only univariate density estimation. 

In density estimation, it is usually difficulty to determine quantitatively the 
number of modes in the underlying distribution, just from the given data. In this 
respect, Theorem 12.41 can be used for parametric density estimation via Gauss- 
ian mixtures. Besides via Gaussian mixtures, a popular approach to density es- 
timation is via the kernel density estimator. T he kernel de n sity estima t or ap - 
proach is non-parametr ic and is treated in de tail in lScottl (|l992t ). [Silverman! {l986), 
Wand and Jones ( 19951 ). Irlardle et al ( 2004 ). The formula for the kernel density 
estimator, given data {X\, ■ ■ ■ X n } is 



(3.3) 



f(x;b) = (nb)- 1 ^K{(x-X i )/b} 



see, e.g. IWand and Jonesl (Il995h . Usually K is chosen to be a unimodal density 
that is symmetric about zero, and is called the kernel. The positive number b is 
called the bandwidth. Such a formulation ensures that f(x; b) is also a density. 
One property of the kernel density estimator is that the choice bandwidth is more 
important than the choice of the kernel itself. The optimal choice of the bandwidth 
ensures that the density estimate becomes optimally smoothed. One popular choice 
of the bandwidth is 



(3.4) 
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where a is the sample standard deviation of the given data and n denotes the sample 
size. One known problem of the bandwidth given in Eq (|3.4[) is that it works well for 
densities that are approximately symmetric unimodal. For multimodal densities, 
the bandwidth tends produce an oversmoothed density. 

Here, we propose an .M-decomposability based algorithm to improve kernel den- 
sity estimation. As we are only dealing with the univariate case, we consider just 
the sorted data F = {^[1], • • ■ -^[n]}- Similar to Section UTTJ we perform clustering 
of F via splitting and merging. In one-dimension, the splitting process becomes 
much simpler as we just have to find m (2 < m < n — 1) such that (<jg + &h) is 
minimized. 

For clarity of explanation, we assume that the original data F has two clusters, 
and that G = {Xm, • • • X\ m {\ and H = {X\ m +i\, ■ ■ ■ Xr n i} are the optimal partition 
of F. We also have ctg + &h < <?f- As such, we can expect the density estimation 
via the weighted mixture of G and H to be better than that of the original data 
set. Therefore, one may propose an mixture kernel density estimator fi of F given 
as follows: 



fi(x) = — g(x; b g ) + h(x; b h ) , 



where 
and 



b g = m s a G , b h = (n-m) b a H , 

m 

g{x- bg) = (mbg)- 1 K{(x - X {i] )/b g } , 

n 

h{x;b h ) = {(n-m)b h }- 1 ^ K {(x - X [{] )/b h } . 

2=m+l 

The original kernel density estimator / of F is given in Eq (|3.3I) . 

As an experiment, we generate a sample of size 1000 from a bimodal density, 
with functional form given as 

f( x ) = 0-2 0-3 

cosh 2 (a; + 2.5) + cosh 2 (a; - 2.5) ' 

The "true" density is shown as solid line in Figs 12, 13. By simply computing 
one single bandwith b on the whole sample set, we obtain a kernel density esti- 
mator (computed using /). The result is shown as crosses in Fig 13. By using 
A^-decomposability and splitting the data into two clusters, we obtain a mixture 
kernel density estimator (computed using fi). The result is shown as crosses in 
Fig 12. From Figs 12 and 13, it is clear that the kernel density estimator computed 
using A^-decomposability is closer to the true density. In this example, we see a 
pronounced effect of oversmoothing (Fig 13) for the kernel density estimator with a 
single bandwidth. This is because the original density is bimodal with modes well 
separated. The undesirable effect of oversmoothing is alleviated by implementing 
A'l-decomposability. 

4. Conclusion 

In this paper, we gen eralized the notion of .M-decom-posability proposed by 
Chia and Nakano (2009) to ci-dimensions, where d > 1. Furthermore, we also 
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True Density (line); KDE with M-Decomposabilily (crosses) 




Figure 12. True density shown as line; kernel estimate with M.- 
decomposbility shown as crosses. 



0.3 - 




Figure 13. True density shown as line; kernel estimate with sin- 
gle bandwidth shown as crosses. Comparing with Fig 12, we see 
that for a multimodal density, .M-decomposability improves kernel 
density estimation. 



broadened the scope of definition of .M-decom-posability to accomodate any num- 
ber of mixture components. We also derived two theorems pertaining to A4- 
decomposability. As a result of the first theorem, all elliptical unimodal densities 



A4-DECOMPOS ABILITY 



23 



are A^-undecomposable. Consequently, any density that is A^-decomposable can- 
not belong to the class of elliptical unimodal densities, which includes many general 
densities, such as Gaussian, Laplace, uniform, logistic, etc. The second theorem 
goes further to say that if a density is .M-decomposable, then it is possible to model 
the density better via a weighted mixture of Gaussian densities. The goodness of 
fit here is defined in Kullback-Leibler sense. VW-decomposability is closely related 
to the modality of probability density functions, and hence the theoretical results 
derived from this paper should appeal to theoreticians and practitioners alike. 

We proposed A^-decomposability as a criterion to determine the modality of a 
given density, i. e. if the density is unimodal or multimodal. A practical application 
is non-parametric cluster analysis. Here, one does not need to know the parametric 
model for the underlying clusters. The only assumption required is that the under- 
lying clusters are approximately elliptical and unimodal. In this sense, clustering 
via .M-decomposability is more flexible and robust than clustering via parametric 
models or via fc-means. Furthermore, we designed a clustering algorithm which 
automatically determines the number of clusters. Our algorithm have been tested 
on non-Gaussian cluster examples, as well as the popular Iris dataset. Another ex- 
ample of application of A'f-decomposability is density estimation. We also devised 
a scheme to improve kernel density estimation. 

Cluster analysis and kernel density estimation are closely related to statistical 
learning. Examples are given in Hastie et a\ ( 2001 ). Therefore, A^- decomposabilit y 
will also be useful in areas such as independent component analysi s jComonl <| 1994T ) , 
Hvvarinen and Oial (|2000h ]. machine learning [Hand et~a3 (|200ll )]. etc. Further- 
more, as A^-decomposability has been demonstrated to improve density estimation, 
it may also be applied to the improve ment of proposal densities in Markov chain 
Monte Carlo (MCMC) methodologies iRobert and CaseUal (|2004[ )] and particle fil- 
tering. For example, in lKotecha and Diuric ( 20031 ). a class of particle filters, called 
Gaussian particle filters were introduced. To represent the prior density at each 
time-step, the authors generated particles from the Gaussian density fitted to the 
weighted particles representing the previous posterior density. Using Theorem l2.41 
the estimation of the prior density can be improved by fitting a mixture of Gauss- 
ian densities to the weighted particles if neces sary, using A^ - decom posability as 
the criterion to determine the fit. Similarly, in iLee and Chial ( 2002 ). the authors 
used Gaussian densities as proposal densities to generate the next prior density via 
MCMC. Using A^-decomposability, it is possible to improve the proposal densities, 
which in turn enhances mixing and improves the acceptance rates of the sequential 
MCMC steps. 



5. Appendix 



5.1. Special Orthogonal Matrices. A class of matrices in d-dimen-sional space 
satisfying 

A- 1 = j4 t , 



1 



is given the name special orthogonal matrices, and denoted as SO(d). Special 
orthogonal matrices include all rotation matrices in d-dimensional space. They play 
an important role in the proof of Lemma 12. II The next theorem, which is related to 
the representatio n of special orthogonal matrices, is brought to our attention from 
Bernstein! (|2005l) . 
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Theorem 5.1. Let A e lZ dxd , where d>2. Then A G SO(d) if and only if there 
exist m such that 1 < m < d(d — l)/2, 9i, . . . , 9 m G 1Z, and ji, . . . , j m , ki, . . . , k m G 
{1, . . . , d} such that 

m 

A = l[P(8. l ,j u k i ), 

i=l 

where 

P{6,j, k) = I d + {cos9- l)(E jd + E k . k ) + (sm9)(E jtk - E Kj ) . 

Here, Id denotes the d- dimensional identity matrix and Eij denotes the dxd matrix 
with one at the (i,j)-th element and zeros everywhere else. 

The proof is given in iFarebrother and Wrobell (<2002l) . 
Remark P(8,j,k) is a plane or Givens rotation. 

Remark Theorem 15.11 is an extension of Euler's rotation theorem, which is the 
case when n = 3. 

5.2. Proof of Lemma 12.11 Without loss of generality, we set the mean of / to 
the origin to simplify computations. Next, note that it is possible to apply a linear 
transformation to the support space of /, such that the transformed density f w 
satisfies 

Id, |E/J = |E/I> 
where 1^ denotes the (i-dimensional identity matrix. As a result of the linear trans- 
formation, we must also have 

max(/™') = max(/) . 

Next, we denote by u the density of the spherical uniform that satisfies max(ti) = 
max(/ u '). Our goal is then to prove that |S/™ | > with identity holding if and 
only if f w = u. In order to facilitate comparisons of pseudo-volumes of f w and it, 
we shall construct a spherical density f s (see Definition 12. 2|1 that satisfies 

S /s = £/„ . 

By construction, we have |X/s| = and therefore, an equivalent statement of 

our goal is |£/=| > |S M |. The steps for the construction of f s are given in the 
following paragraph. 

We denote by fj the resultant probability density function when a rotation op- 
erator Rj € SO(d) is applied onto the support space of f w . We have 

= E/v, and fi f .=0 = fi f . 

In other words, the mean and covariance of f w are invariant to rotation if oc 1^. 
For any rotation operators Ri, Rj G SO(d), any weighted mixture of fi and fj will 
again have the same mean and covariance matrix. Denoting the mixture by g, we 
have 

c/(x) = a/i(x) + (1 - a)/j(x), 
where < a < 1 . The covariance of g is given by 

s s = + (! - + "(1 - - - M/,) T = 

In two-dimensional space, a rotation operator can be represented as 



R» = 



cos v — sm f 
sin 9 cos 9 
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From Theorem 15.11 it is possible to represent any rotation in d-dimcnsional space 
as a product of Given's rotations shown below. 

R = R e 1 1 ■■■ R e D D 

where D = d(d— l)/2. We are ready to construct f s as follows: 



(5.1) 



2tt Jo Jo 



D times 

By construction, f s is the uniform mixture of all possible rotations of the probability 
density function f w in <i-dimensional space. To show that E/s = S/™ , note that 



E/. = y xx T f(x)(ix 



(^) D jf * ■ ■ • { | xx T (i?? 1 • • • <"x) dx} dfli ■ • ■ d0 



D ■ 



D times A 

The term A is simply the covariance matrix of the transformed density after apply- 
ing rotation operator R® 1 ■ ■ ■ R£ to the support space of f w . As E /«» is invariant 
to rotation, we have 

£> times 

Furthermore, /" must be spherical as one can easily verify that / s (i?x) = / s (x) 
for any i? e SO(d). On top of these, from Eq (15. If) . we have 

(5.2) / s (x) < (-^) D ■ • • max(r ) dft • • • = max(D . 

D times 

We have therefore constructed a spherical density / s whose covariance matrix is 
the same as that of f w . Now we are left with proving that |E/s | > |E U | to complete 
the proof of the lemma. 

We express the covariance matrix of u by E u = fc u Our goal will be accom- 
plished if we can prove that kf > k u . From Eq (|5.2[) . we have / s (x) < max(/) = 
Mf, and the followings are straightfoward: 

(1) it(x) > / s (x) for |x| < R, where tt(x) = Mf throughout. 

(2) it(x) < / s (x) for |x| > R, where m(x) = throughout. 

Here, R represents the radius of the spherical uniform u. Moreover, as f s and u 
are both spherical and have means centred at the origin, there exist functions f s 
and u such that 

/ s (x) = f(|x|) = /»; u(x) = 6(1*1) = u(r) , 



using Definition 12.21 and representation in the hyperspherical coordinates. Further- 
more, we define /i(x) = / s (x) — u(x). Note that h(x) is not a probability density 
function as h(x) takes negative values and 



(5.3) J h(x)dx = 0. 
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Using the hyperspherical coordinate representation, there must exist a function h 
such that ft-(|x|) = h(r), and 



h(r) 



< for r < R; 
> for r > R. 



Note that h is identically if and only if f s = u, or equivalently, / is elliptical 
uniform. Now, 

kf — ku = ei T (E /3 - S u ) ei 



J ei T xx T ei{/ s (x) — u(x)} dx 
y |ex T x| 2 h(x) dx. 



Here, ei is the unit vector parallel to the first axis. Representation via spherical 
coordinates yields 



kf - K =/•••/ x{ h(r) r^ 1 sin d - 2 (0i) • • • sin(0 d _ 2 ) dr 



r + h[r) dr x $! x ■ ■ ■ x ®d-l 



ji ■ ■ ■ i 



with 



*i = / cos 2 (0i) siii d - a (0i)d0i, = 2tt, 

Jo 

and the rest of $i's (2 < i < d — 2) satisfying 

Jo 

Apparently, all $i's are strictly positive and we only need to prove 

/>oo 

(*) / r d+1 h(r) dr > 



to arrive at the conclusion that kf > /c u . Representing Eq (|5 ,3|) via hyperspherical 
coordinates, we have 

r d_1 h(r) dr x sin d ~ 2 Oi) dfa x $ 2 x • • • x = 



and therefore 

/>oo 

/ r d_1 ft(r) dr = . 
Jo 

To prove (j*j), we break up the integral into as follows: 

/>i?. />oo 

r d+1 fo(r) dr = / r^ 1 r 2 ft(r) dr + / r^ 1 r 2 fo(r) dr 

JO ^v— ' Ji? ^-n^ 

<0 >0 

/>ii! /-oo 

> / r" 1 - 1 R 2 h{r) dr + \ r 4 ' 1 R 2 h{r) dr 
Jo Jr. 

/•oo 

= i? 2 x / r^ 1 ft(r) dr = . 
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Equality holds if and only if h — identically, implying that / s is spherical uniform, 
or in other words, / is elliptical uniform. This proves kf > k u and consequently, 
we have 

|S / | = |E / .|>|S t ,|. 

Finally, we need to show that the pseudo- volume of an elliptical uniform density 
u with max(tt) = M u is given by 

(5-4) W >- r<j + " ,- 

We first compute the covariance matrix of an uniform density on the hypersurface 
of the d-dimensional sphere. Consider the probability mass function of a discrete 
random variable X given below: 



/x(x) 



Yd if x = ±ae, , j = 1, . . . ,d, 
otherwise. 



The covariance matrix of the above distribution is computed as a 2 ld/d. It is possi- 
ble to generate an uniform density on the hypersurface of the ci-dimensional sphere 
of radius a, by applying rotations to the discrete random variable given in Eq (|5.4p . 
Therefore, the covariance matrix of an uniform density on the hypersurface of a 
d-dimensional sphere of radius r is a 2 Id/d. By considering a spherical uniform 
density as a continuous mixture of hypersurfaces, we obtain the covariance matrix 
of a spherical uniform density with radius r as 

, x I d fn a d+1 da r 2 

(5.5) £« = -T'° 



d g a^ 1 da d + 2 " 
We therefore obtain the pseudo-volume of a spherical uniform density radius r as 



(d + 2)i 

Using the fact that the volume of a d-dimensional sphere of radius r is given by 



7= 7r2r 



r(| + i) 

we obtain the require pseudo- volume. Hence, the proof for Lemma l2.1l is complete. 

□ 

5.3. Proof of Theorem 12.21 We can define the following continuous function on 
non-negative values of y, for a given /: 



q(y) = J mm{f{x),y}dx. 



Then, q is increasing with q(0) =0. If / is unbounded, then q is strictly increasing 
for all y with lim^oo q(y) = 1. If / is bounded such that max(/) = F, then q is 
strictly increasing for < y < F and q(y) — 1 for all y < F. 

We can rewrite / as a sum of two positive functions in the form 

/(x) = /«(x)+/( 2 )(x), 
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where /W(x) = min{/(x),y} and Y is positive. For a given e > 0, it is always 
possible to choose Y such that 

1-| < J fW(x)dK = q{Y) < 1, 

and therefore 

0< |/( 2 )(x)rfx<l, 

because q is continuous ranging between and 1. The above "slicing" ensures that 
the function /W is bounded from above by Y. Let h = Y/n. Define a set of real 
numbers {r n ,i, . . . , r n ,„} by 

r n ,j = sup{r|p(r 2 ) > j h} . 

Here, p the non-increasing function defined on 1Z + U {0} which satisfies /(x) = 
p[(x-^) T S- 1 (x- y u)]. Setting 

n,j 

we can then construct a density g n such that 

n 

9Vi(x) = y^^n,j w«j(x) . 

Next rewrite (/„ as a sum of two positive functions in the form of 

9 „(x)= ff ( 1 )(x)+ 5 ( 2 )(x), 

where 

n 

J=l 

Here, all three functions g n ,gn^ and gi 2 '' are proportional to one another. Further- 
more, by construction, is dominated everywhere by f (1 \ We also have 

< / (1) (x) - gP(x) < min{/(x), h} < h . 
It is therefore possible to choose n (and hence h) such that 

J | 3 «(x)-/W(x)|dx = |{/( 1 )(x)- 3 i 1 )(x)}dx = <Z (/,)<i. 

Finally, applying the triangle inequality on integrals, we have 

J | fln (x) - /(x)| dx 

< J | 5 i 1 )(x)-/W(x)|dx + |.g( 2 )(x)dx + |/( 2 )(x)dx. 

The first and third terms on the right-hand-side of the inequality are both less than 
e/4. The second term is 

J g^ (x) dx = 1 - y , 9 « (x) dx < 1 - | /« (x) rfx + | = | . 

Hence, we arrive at 

y 1.9™ (x) - /(x)| dx < e, 
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completing the proof of Theorem 12.21 □ 
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